Citation Segmentation from Sparse & Noisy Data: An Unsupervised Joint Inference Approach with Markov Logic Networks

نویسندگان

  • Dustin Heckmann
  • Anette Frank
  • Matthias Arnold
  • Peter Gietz
  • Christian Roth
چکیده

Citation Segmentation in a Digital Humanities Context. Bibliographies are an important resource for scientific research. Their storage in (online) bibliographic databases offers efficient search functionalities for wide-spread and timely use in international research communities. For this purpose it is crucial to automatically detect the inherent structure of bibliographic references, by isolating and extracting citation subfields (e.g., author, title, venue). Previous approaches in citation segmentation strongly rely on language-specific lexical data and multiple occurrences of the same citation entry in online publication repositories. However, when dealing with multilingual data, the use of language-specific knowledge becomes difficult. Moreover, self-contained data sources like printed bibliographies are naturally short of recurring citation entries, and thus cannot rely on data redundancy. In this work, we present an approach to citation segmentation that operates on sparse and noisy OCR input originating from a single, multilingual bibliography, the Turkology Annual (Turkologischer Anzeiger).1 The Turkology Annual is a bibliography for Turkology and Ottoman studies, comprising 28 volumes which are only available in print. Citation entries containing multiple languages and scripts, the shortage of citation redundancy, frequent OCR errors and inconsistencies in citation structure impede the use of state-of-the-art statistical approaches for citation segmentation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Generalized Joint Inference Approach for Citation Matching

Citation matching is the problem of extracting bibliographic records from citation lists in technical papers, and merging records that represent the same publication. Generally, there are three types of datasets in citation matching, i.e., sparse, dense and hybrid types. Typical approaches for citation matching are Joint Segmentation (Jnt-Seg) and Joint Segmentation Entity Resolution (Jnt-Seg-E...

متن کامل

Joint Unsupervised Coreference Resolution with Markov Logic

Machine learning approaches to coreference resolution are typically supervised, and require expensive labeled data. Some unsupervised approaches have been proposed (e.g., Haghighi and Klein (2007)), but they are less accurate. In this paper, we present the first unsupervised approach that is competitive with supervised ones. This is made possible by performing joint inference across mentions, i...

متن کامل

Joint Inference in Information Extraction

The goal of information extraction is to extract database records from text or semi-structured sources. Traditionally, information extraction proceeds by first segmenting each candidate record separately, and then merging records that refer to the same entities. While computationally efficient, this approach is suboptimal, because it ignores the fact that segmenting one candidate record can hel...

متن کامل

Semantic analysis of spoken input using Markov logic networks

We present a semantic analysis technique for spoken input using Markov Logic Networks (MLNs). MLNs combine graphical models with first-order logic. They are particularly suitable for providing inference in the presence of inconsistent and incomplete data, which are typical of an automatic speech recognizer’s (ASR) output in the presence of degraded speech. The target application is a speech int...

متن کامل

Knowledge-leveraged Computational Thinking through Natural Language Processing and Statistical Logic (NII Shonan Meeting 2011-4)

This talk describes a recent effort on the development of a textual entailment data set. Rather than assuming a sub-component of applications like question answering and multi-document summarization, we focus on a realworld task to judge whether a natural language proposition is true or false according to a given text. I will describe the design of resource development and features of the obtai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013